Using Loglinear Clustering for Subcategorization Identification

نویسندگان

  • Nuno Miguel Marques
  • José Gabriel Pereira Lopes
  • Carlos Agra Coelho
چکیده

In this paper we will describe a process for mining syntactical verbal subcategorization, i.e. the information about the kind of phrases or clauses a verb goes with. We will use a large text corpus having almost 10,000,000 tagged words as our resource material. Loglinear modeling is used to analyze and automatically identify the subcategorization dependencies. An unsupervised clustering algorithm is used to accurately determine verbal subcategorization frames. In this paper we just tackle verbal subcategorization of noun phrases and prepositional phrases. A sample of 81 Portuguese verbs was used for evaluation purposes 97% precision and 99% recall for noun phrases and 92% precision and 100% recall for prepositional phrases was obtained.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Advancing Linguistic Features and Insights by Label-informed Feature Grouping: An Exploration in the Context of Native Language Identification

We propose a hierarchical clustering approach designed to group linguistic features for supervised machine learning that is inspired by variationist linguistics. The method makes it possible to abstract away from the individual feature occurrences by grouping features together that behave alike with respect to the target class, thus providing a new, more general perspective on the data. On the ...

متن کامل

Clustering Polysemic Subcategorization Frame Distributions Semantically

Previous research has demonstrated the utility of clustering in inducing semantic verb classes from undisambiguated corpus data. We describe a new approach which involves clustering subcategorization frame (SCF) distributions using the Information Bottleneck and nearest neighbour methods. In contrast to previous work, we particularly focus on clustering polysemic verbs. A novel evaluation schem...

متن کامل

A Corpus-based Conceptual Clustering Method for Verb Frames and Ontology Acquisition

We describe in this paper the ML system, ASIUM, which learns subcategorization frames of verbs and ontologies from syntactic parsing of technical texts in natural language. The restrictions of selection in the subcategorization frames are filled by the concepts of the ontology. Applications requiring subcategorization frames and ontologies are crucial and numerous. The most direct applications ...

متن کامل

Learning Verbal Transitivity Using LogLinear Models

Portuguese dictionaries lack information about the kind of phrases or clauses a verb goes with. In this paper we describe how this information can be computationally learned from an automatically tagged corpus with almost 10,000,000 words. Loglinear modeling for categorical data will be used to analyze and automatically identify the subcatego-rization dependencies. Loglinear models were also us...

متن کامل

Automatic Extraction of Subcategorization Frames for Czech

We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998